Chapter 4: Regression–Exploring Association Between Variables

COR 142 C

Author

Dr. Kao

Section 4.1. Visualizing Variability With a Scatterplot

NoteScatterplot
  • The primary tool for examining relationship between two numerical variables.
  • Each point in a scatterplot represents a single observation.
TipExample 4.1.1.

What does the scatterplot below tell you about the relationship between number of floors in a building and their heights?

In general, the height of a building tends to increase as the number of its floor increases.

NoteExamining Scatterplots

There are three features of a scatterplot we will focus on.

  • Trend:: Finding the “center” of the scatterplot from left to right
  • Strength: How “spread out” the points are in the plot
  • Shape: Linear or nonlinear
NoteShape
  • Linear: Scatterplots that cluster around a line model linear trends.
  • Nonlinear: Not all data trends are linear, we model the nonlinear data using a curve rather than a line.
NoteTrend
  • The general tendency of the scatterplot as you read from left to right.
  • The typical trends are:
    1. Increasing, this is called a positive association.
    2. Decreasing, this is called a negative association.
    3. No trend, if there is neither a positive nor a negative tendency.
NoteStrength of an Association
  • Scatterplots with small amount of scatter, or little vertical variation indicate a strong association.
  • Scatterpltos with large amount of scatter, or large vertical variation indicate a weak association.
NoteWriting Descriptions of Associations

When writing a description of an association between two numerical variables, always include the following:

  1. trend (positive, negative, or no trned)
  2. strength (strong or weak)
  3. shape (linear or nonlinear)
  4. Mention any observations that do that fit the general trend (if any).
NoteBe Careful Describing Association
  • Always use a phrase such as tends to when describing an association because any trend in description has variability. The association you are describing may not be true for all individual data.
  • Always point out any data points that appear to be unusual or not part of the general pattern.

Section 4.2. Measuring Strength of Association with Correlation

NoteCorrelation Coefficient
  • A number that measures the strength of a linear relationship.
  • We use the letter \(r\) to denote the correlation coefficient.
  • \(r\) always takes value between \(-1\) and \(+1\), i.e., \(-1 \leqslant r \leqslant 1\).
  • \(r\)-value close to \(-1\) or \(+1\) indicate a strong linear association.
  • \(r\)-value close to 0 indicate a weak linear association.

TipExample 4.2.1.

Consider the following scatterplot, each with a given correlation coefficient r. Use the slider to adjust the correlation coefficient of the scatterplot. What can you say about the relationship between r and the vertical variation of the data?

As \(r\) gets closer to \(+1\) or \(-1\), there is less vertical variation in the data. In other words, the trend gets stronger for \(r\) that are close to \(+1\) or \(-1\).

ImportantImportant Notes about the Correlation Coefficient
  1. Changing the order of the variable does not change \(r\).
  2. Adding a constant or multiplying by a positive constant does not affect \(r\).
  3. \(r\) is unitless.
  4. \(r\) is only useful to measure a linear trend. Always graph your data first before computing \(r\) to make sure the association is linear.

Sections 4.1, 4.2 Examples

TipExample 1.

Below is a scatterplot showing the weights of 36 cars versus their horsepower. What can you say about the trend in the plot? Describe the relationship between the weight of the cars and their horsepower.

The scatterplot shows a positive trend because the graph increases from left to right. This means as the weight of the car increases, the horsepower tends to increase also.

TipExample 2.

Below is a scatterplot of the weight of 36 cars vs their mileage (mpg). What can you say about the trend in the plot? Describe the relationship between the weight of the cars and their mileage.

The scatterplot shows a negative trend because the graph decreases from left to right. This means that as the weight of the car increases, the mileage tends to decrease.

TipExample 3.

Below is a scatterplot of heart rate (bpm) vs. body temperature (F). What can you say about the trend in the plot? Describe the relationship between heart rate and body temperature.

The scatterplot shows no trend because the points seem to follow no predictable pattern. This means that there is no relationship between heart rate and body temperature.

TipExample 4.

Consider the scatterplot shown below. This data set shows an association between the \(x\)- and \(y\)-variables, but it cannot be characterized as positive or negative. It is certainly ______.

Consider the scatterplot shown below. This data set shows an association between the \(x\)- and \(y\)-variables, but it cannot be characterized as positive or negative. It is certainly nonlinear.

TipExample 5.

Consider the two scatterplots below. One shows height (in) versus weight (lb), and the other shows waist size (in) versus weight (lb). Compare the strength of the associations in these two plots.

There is a stronger association between waist size and weight (less vertical variation in the graph) than between height and weight.

TipExample 6.

The scatterplot below shows snout vent length (in) vs weight of alligators. What kind of relationship can you conclude from the plot?

The relationship between the snout length and weight of alligators is a strong, positive, and linear one.

TipExample 7.

The table below shows the number of miles driven on a recent trip for 8 rented cars and the amount of gas (in gallons) used for each trip.

  1. Use the built-in app to compute and interpret the correlation coefficient, \(r\).
  2. What can you say about the relationship between miles driven and fuel consumption?
Miles Driven 200 150 110 60 350 225 400 300
Fuel Consumption 11.1 5.37 4.4 1.71 11.67 7.76 12.5 9.09
  1. According to the app, the calculated \(r\)-value is 0.9055.
  2. Since \(r\) is close to \(+1\), there is a strong, positive linear association between miles traveled and fuel consumption.

The relationship between the snout length and weight of alligators is a strong, positive, and linear one.

TipExample 8.

Match each scatterplot to its closest \(r\)-value in the table below.

\(0.885\) \(0.724\) \(-0.341\) \(0.119\)

Reading the plots from left to right.

* The 4th plot shows a negative trend, so its \(r\)-value must be the only negative value: \(r = -0.341\).
* The 2nd plot shows the strongest positive association, so its \(r\)-value must be the closest to \(+1\): \(r = 0.885\).
* The 3rd plot also shows a positive trend, but its strength of association is not as strong as the 2nd plot, so \(r = 0.724.\).
* This means the 1st plot must have \(r = 0.119\). This choice is justified, since the plot shows very little positive trend and nearly no association.

TipExample 9.

Suppose a new data set has been added to the table in Example 7 with the new scatterplot shown below.

  1. Describe and interpret the association.
  2. What do you think happened with the one point that does not follow the trend? Does this value seem accurate?
  3. If the potential outlier in the graph is corrected or removed, what do you think will happen to the \(r\)-value?

a. There seems to be a strong linear association between the miles traveled and fuel consumption. There is, however, an unusual data point that does not follow the trend.
b. It could be a typographical error, or it could be an electric/hybrid vehicle. The scatterplot is reasonable if the vehicle is an electric or hybrid.
c. If the outlier is removed, then the \(r\)-value will increase.

TipExample 10.

Both scatterplots below assess average miles per gallon fuel consumption for passenger cars. Both depict data from the sample of vehicles.

  1. What is the approximate average mpg for a car with a cab space of 120 cubic feet?
  2. What is the approximate average mpg for a car with a horsepower of 150?
  3. Are these associations positive, negative, or there is no trend?
  4. Which do you think has a stronger relationship with average mileage–the car’s horsepower or the cab space? Why?
  5. If you wanted to predict the average mileage for a car, would you be able to make a better prediction by knowing its horsepower or its cab space in cubic feet?

a. Using the blue regression line as a guide, a car with 120 cubic feet of cab space has an average mileage of roughly 26 mpg.
b. Using the red regression line as a guide, a car with a 150 horsepower has an average mileage of roughly 28 mpg.
c. Both lines have negative slopes, so both associations are negative.
d. The red line has a steeper slope, indicating that the association between horsepower and average mileage is stronger.
e. Because the relationship between horsepower and average mileage is stronger, one can make better predictions of a car’s average mileage by knowing its horsepower rather than its cab space.

Section 4.4. Evaluating the Linear Model

CautionCautionary Notes Regarding Linear Regression
  • Do not use linear to describe nonlinear associations.
  • Correlation is not causation: Association between two variables is not sufficient evidence to conclude that a cause-and-effect relationship exists between the variables.
  • Beware of outliers: Outliers can have a big effect on \(r\). Always check the scatterplot for outliers first.
  • Do no extrapolate: Do not make predictions beyond the range of the observed data. We can never be certain a linear trend will continue beyond the range of the data.
NoteCoefficient of Determination
  • The coefficient of determination is \(r^2\), where \(r\) is the correlation coefficient.
  • The \(r^2\)-value is usually converted to a percentage, so it is always between 0% and 100%.
  • It measures how much variation in the response variable (\(y\)) is explained by the explanatory variable (\(x\)).
  • The larger \(r^2\) is, the smaller the amount of variation or scatter about the regression line there is.

Section 4.3, 4.4 Examples

TipExample 1.

Below is a scatterplot showing the relationship between mileage (in thousands of miles) and the price of used cars (in thousands of dollars).

  1. The scatterplot shows a _______ trend. As the mileage on a car increases, its sales price _______.
  2. Suppose the equation of the regression line is \[ \text{Price} = 15,867 - 34 \text{Miles} \] Use the regression equation to predict the sales price of a car that has 62,000 miles.

  1. The scatterplot shows a negative trend. As the mileage on a car increases, its sales price decrease.
  2. First convert 62,000 miles into \(62000/1000 = 62\) since the \(x\)-axis is measured in thousands of miles. Then substitute this into the “Miles” variable of the regression equation: \[ \text{Price} = 15867 - 34 \times 62 = \$13,769. \]
TipExample 2.

It was found that there was a strong relationship between the heights and weights in a group of six women. The regression equation was given as \[ \text{Weight} = -442.882 + 9.029 \text{ Height}. \] Interpret the \(y\)-intercept, if appropriate.

The \(y\)-intercept in this instance is \(-442.882\). This implies that a woman would weigh \(-442.882\) lb if her height were 0 in, which is not physically meaningful.

TipExample 3.

The scatterplot and regression line show the relationship between population size and number of fatal car accidents for all 50 states. Use the data provided to answer the questions below.

  1. Identify the dependent and independent variables.
  2. Is this a positive or negative correlation?
  3. Describe the association between the variables.
  4. Interpret the slop and the \(y\)-intercept, if appropriate.
Simple Linear Regression Results
Dependent Variable: Fatal Crashes
Independent Variable: Population
Regression Equation: \(\text{Fatal Crashes} = 77.073 + 0.0000826 \text{ Population}\)
Sample Size: 51
Correlation Coefficient (\(r\)): 0.93228819
R-Square 0.86916127

a. The dependent (\(x\)-) variable is population; the independent (\(y\)-) variable is number of fatal crashes. b. The regression equation has a positive slope of \(m = 0.0000826\), so the association is positive. c. As the population increases, the number of fatal crashes tends to increase. d. - Slope: For every additional 1 person in the population, the number of fatal crashes increases by 0.0000826. - \(y\)-intercept: The \(y\)-intercept is 77.073. This means that if a state had a population of zero, there would be 77.073 fatal crashes, which is not physically possible.

TipExample 5.

Suppose there is a strong linear relationship between women’s heights and their weights, with a regression equation given below: \[ \text{Weight} = -442.882 + 9.029 \text{ Height} \] What weight does this equation predict for a woman who is 36 inches tall?

Substitute 36 into Height in the equation: \[ \text{Weight} = -442.882 + 9.029 \times 36 = -117.838 \text{ lb} \] This is not possible because a person cannot have a negative weight.

TipExample 6.

Suppose the data on a car’s age and its predicted monetary value produces a correlation coefficient of \(r = −0.778\). Compute and interpret the coefficient of determination.

Recall that the coefficient of determination is defined as \(r^2\), where \(r\) is the correlation coefficient. Therefore, \[ r^2 = (-0.778)^2 = 0.605284 \approx 60.53\% \] This means that approximately 60.53% of the variation in the dependent variable can be explained by the independent variable.

Chapter 4 Quick Review

TipQuestion 1.

The scatterplot below shows what type of relationship between the number of runs scored and the number of hits for a group of baseball players?

There is a moderately strong, positive linear relation between the number of runs scored and the number of hits.

TipQuestion 2.

There is a negative association between the percentage of smoke-free homes and the percentage of high school students who smoke. This means what exactly?

This means that as the percentage of smoke-free homes increases, the percentage of high schoolers who smoke tends to decrease.

TipQuestion 3.

For a certain group of cars, there is a strong association between the city and the highway mileage that can be described by the following regression equation: \[ \text{Predict Hwy MPG} = 8.25 + 0.87 \text{ City MPG} \] Interpret the slope.

The slope in this instance is \(m = 0.87\). This meant that for every additional one mile per gallon in city mileage, the predicted highway mileage increases by 0.87.

TipQuestion 4.

Which of the following values of correlation coefficient indicates the strongest association between two variables? Why? \[ 0.32, \; 0.62, \; 0.78, \; -0.98 \]

The correlation coefficient \(r\) indicates the strength of the association between two variables; the closer \(r\) is to either \(+1\) or \(-1\), the stronger the relationship. In this case, \(r = -0.98\).

TipQuestion 5.

When doing a regression analysis on a data set, which of the following remain the same no matter which variable is chosen for \(x\) and which is chosen for \(y\)?

  1. The \(x\)-intercept of the regression equation
  2. The slope of the regression equation
  3. The correlation coefficient
  4. All of the above

The correct answer is c. The only value in linear regression that is invariant to the order of the variables is the correlation coefficient (\(r\)).